library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
select(vegetable, date, weight) %>%
mutate(day = wday(date, label = TRUE)) %>%
group_by(vegetable, day) %>%
summarize(total_weight_day = sum(weight)) %>%
pivot_wider(id_col = vegetable,
names_from = day,
values_from = total_weight_day)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?I began by simply summarizing the original “garden_harvest” data set to calculate the total harvest in lbs for each vegetable variety. I renamed this garden_harvest2. I then selected only the essential columns from the garden_planting data set and named it “selected_garden_planting”, I then did a full join to by vegetable and variety to add plot to the original garden_harvest2. Problems may arise if varieties were planted in multiple plots.When this happens, the harvest value may be attributed to multiple different plots, instead of the proportionate value from each.
garden_harvest2 <- garden_harvest %>%
group_by(variety, vegetable) %>%
summarize(total_weight = sum(weight)) %>%
mutate(total_weight_lbs = total_weight*0.00220462)
selected_garden_planting <- garden_planting %>%
select(vegetable, variety, plot)
garden_harvest2 %>%
full_join(selected_garden_planting,
by = c("vegetable", "variety"))
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.The first step I would take would be to modify, and simplify some of the garden_harvest data. Presumably by finding the total accumulated harvest of each individual vegetable and variety combination. Once this has been done, and there is only one row for each vegetable and variety combo in the new garden_harvest data set. I would attempt to add another variable using mutate(), a new variable that would do its best to approximate how much it would cost to buy the amount of produce that Lisa grew and collected herself. Next I would do an inner_join from that data to the data set garden_spending by the variables “vegetable” and “variety”. Finally, using this new joined dataset I would again mutate a new variable, one that takes the already found estimated price of Lisa’s produce by vegetable variety, and subtract what Lisa paid for the seeds. This should give us the difference between what it cost Lisa originally, and what it would cost the average person now, evaluating Lisa’s savings.
plant_min_date <- garden_planting %>%
group_by(vegetable, variety) %>%
summarize(earliest_planting = min(date))
days_of_planting_and_harvest <- garden_harvest %>%
left_join(plant_min_date,
by = c("vegetable", "variety")) %>%
na.omit()
days_to_harvest <- days_of_planting_and_harvest %>%
group_by(vegetable, variety) %>%
mutate(day_diff = date - earliest_planting) %>%
summarize(plant_to_harvest = min(day_diff)) %>%
arrange(plant_to_harvest)
days_to_harvest
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
select(vegetable, variety) %>%
mutate(variety_length = str_length(variety)) %>%
group_by(vegetable, variety) %>%
summarize(variety_length_final = min(variety_length)) %>%
mutate(variety_lower = str_to_lower(variety)) %>%
arrange(vegetable, variety_length_final)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(has_er_ar = str_detect(variety, "er|ar")) %>%
filter(has_er_ar == "TRUE") %>%
distinct(vegetable, variety)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
theme(axis.text.y = element_blank()) +
labs(title = "Density of Bicycle Use By Date", x = "", y = "")
It seems that bicycle use decreased as the winter months approached, and then arrived. This may be attributed to worsening weather conditions.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
labs(title = "Bike Rental by Time of Day (hrs)", x = "", y = "") +
theme(axis.text.y = element_blank())
It seems like the largest usage of the bikes occurs at around 8am and 5pm - 6pm. This is interesting, such observations can likely be paired with the knowlege of a 9am - 5pm work day.
Trips %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = week_day)) +
geom_bar() +
labs(title = "Usage of Bicycles by Day of the Week", x = "", y = "")
This bar graph gives us insight into the usage of bikes by weekday. The data makes it clear that the bikes were used less on weekends than they were during the week.
Trips %>%
mutate(week_day = wday(sdate, label = TRUE),
hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
labs(title = "Bike Rental by Time of Day (hrs)", x = "", y = "") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(week_day))
The correlation between the distrobution of bike rentals in the graph above does an even better job at highlighting the impact of the average work commute. During the work week the most popular times are at 8am and 5pm, while on weekends the usage is dispersed over a larger time period in the afternoon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(week_day = wday(sdate, label = TRUE),
hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60) %>%
ggplot(aes(x = time_of_day)) +
geom_density(aes(fill = client, alpha = .5), color = NA) +
labs(title = "Bike Rental by Time of Day (hrs) and Type of Client", x = "", y = "") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(week_day))
This new “fill”ed graph shows a density plot for both types of clients, exposing a serious difference in usage patterns. During the workweek, the registered clients are quite inactive from 10am - 3pm, while this is when the casual clients are most active.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(week_day = wday(sdate, label = TRUE),
hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60) %>%
ggplot(aes(x = time_of_day)) +
geom_density(aes(fill = client, alpha = .5), color = NA, position = position_stack()) +
labs(title = "Bike Rental by Time of Day (hrs) and Type of Client", x = "", y = "") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(week_day))
This new stacking functions simply graphs the registered density, and then adds the data from casual users on top. The disadvantage of this type of graph is the difference in usage patterns is not easily seen. This may be because the number of registered users is bigger, or they are just more consistant/using the bikes more often. On the other hand, this stacked graph is useful because it helps us understand how the density of the total bike user population is formed by each client subset.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(week_day = wday(sdate, label = TRUE),
hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60,
type_of_day = ifelse(wday(sdate) %in% c(1,7), "weekend", "weekday")) %>%
ggplot(aes(x = time_of_day)) +
geom_density(aes(fill = client, alpha = .5), color = NA) +
labs(title = "Bike Rental by Time of Day (hrs), Type of Client, and Type of Day", x = "", y = "") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(type_of_day))
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(week_day = wday(sdate, label = TRUE),
hour2 = hour(sdate),
minute2 = minute(sdate),
time_of_day = hour2 + minute2/60,
type_of_day = ifelse(wday(sdate) %in% c(1,7), "weekend", "weekday")) %>%
ggplot(aes(x = time_of_day)) +
geom_density(aes(fill = type_of_day, alpha = .5), color = NA) +
labs(title = "Bike Rental by Time of Day (hrs), Type of Client, and Type of Day", x = "", y = "") +
theme(axis.text.y = element_blank()) +
facet_wrap(vars(client))
This graph better compares the shift withing client type regarding the different parts of the week. It is very easy to see the most popular time of usage shift for the registered users on the weekends.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips %>%
count(sstation) %>%
inner_join(Stations,
by = c("sstation" = "name")) %>%
ggplot(aes(x = lat, y = long, color = n)) +
geom_point() +
labs(title = "Total Departures From Each Station Represented by Their Longitude and Latitude", x = "Latitude", y = "Longitude")
After assigning the color to the number of departures from a given station, it becomes clear that the majority of the bikes usage comes in one central area.
Trips %>%
group_by(sstation) %>%
summarize(total_departures = n(),
prop = mean(client == "Casual")) %>%
inner_join(Stations,
by = c("sstation" = "name")) %>%
ggplot(aes(x = lat, y = long, color = prop)) +
geom_point() +
labs(title = "Stations with the Highest Casual User Percentage by Longitude and Latitude", x = "Latitude", y = "Longitude")
While certain stations within the biggest cluster have a higher proportion of casual users, there are also stations on the outskirts that have a high percentage. It also seems that there is a correlation between the percentage of users being casual, and the overall departures from that station.
as_date(sdate) converts sdate from date-time format to date format.Top_departure_dates <- Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(sstation, date) %>%
summarize(num_departures = n()) %>%
arrange(desc(num_departures)) %>%
head(n = 10)
Top_departure_dates
Top_departure_dates %>%
left_join(Trips,
by = c("sstation", "date"))
## Error: Join columns must be present in data.
## x Problem with `date`.
Top_departure_dates %>%
inner_join(Trips,
by = c("sstation","date")) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
group_by(client, week_day) %>%
summarize(client_departures = n()) %>%
group_by(client) %>%
mutate(prop = client_departures/(sum(client_departures))) %>%
pivot_wider(id_col = week_day ,
names_from = client,
values_from = prop)
## Error: Join columns must be present in data.
## x Problem with `date`.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?